
Dr Ignaz Semmelweis was a Hungarian physician born in 1818 who worked in the Vienna General Hospital. In the past people thought of illness as caused by "bad air" or evil spirits. But in the 1800s Doctors started looking more at anatomy, doing autopsies and started making arguments based on data. Dr Semmelweis suspected that something was going wrong with the procedures at Vienna General Hospital. Semmelweis wanted to figure out why so many women in maternity wards were dying from childbed fever (i.e., puerperal fever).

Today you will become Dr Semmelweis. This is your office 👆. You will step into Dr Semmelweis' shoes and analyse the same data collected from 1841 to 1849.
Dr Semmelweis published his research in 1861. I found the scanned pages of the full text with the original tables in German, but an excellent English translation can be found here.

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.
# %pip install --upgrade plotly
import plotly
plotly.__version__
'5.10.0'
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
pd.options.display.float_format = '{:,.2f}'.format
# Create locators for ticks on the time axis
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
df_yearly = pd.read_csv('annual_deaths_by_clinic.csv')
# parse_dates avoids DateTime conversion later
df_monthly = pd.read_csv('monthly_deaths.csv',
parse_dates=['date'])
Challenge: Check out these two DataFrames ☝️.
df_yearly
| year | births | deaths | clinic | |
|---|---|---|---|---|
| 0 | 1841 | 3036 | 237 | clinic 1 |
| 1 | 1842 | 3287 | 518 | clinic 1 |
| 2 | 1843 | 3060 | 274 | clinic 1 |
| 3 | 1844 | 3157 | 260 | clinic 1 |
| 4 | 1845 | 3492 | 241 | clinic 1 |
| 5 | 1846 | 4010 | 459 | clinic 1 |
| 6 | 1841 | 2442 | 86 | clinic 2 |
| 7 | 1842 | 2659 | 202 | clinic 2 |
| 8 | 1843 | 2739 | 164 | clinic 2 |
| 9 | 1844 | 2956 | 68 | clinic 2 |
| 10 | 1845 | 3241 | 66 | clinic 2 |
| 11 | 1846 | 3754 | 105 | clinic 2 |
df_monthly
| date | births | deaths | |
|---|---|---|---|
| 0 | 1841-01-01 | 254 | 37 |
| 1 | 1841-02-01 | 239 | 18 |
| 2 | 1841-03-01 | 277 | 12 |
| 3 | 1841-04-01 | 255 | 4 |
| 4 | 1841-05-01 | 255 | 2 |
| ... | ... | ... | ... |
| 93 | 1848-11-01 | 310 | 9 |
| 94 | 1848-12-01 | 373 | 5 |
| 95 | 1849-01-01 | 403 | 9 |
| 96 | 1849-02-01 | 389 | 12 |
| 97 | 1849-03-01 | 406 | 20 |
98 rows × 3 columns
print(f"Our yearly data has {df_yearly.shape[1]} \
columns and {df_yearly.shape[0]} rows.")
print(f"Our monthly data has {df_monthly.shape[1]} \
columns and {df_monthly.shape[0]} rows.")
Our yearly data has 4 columns and 12 rows. Our monthly data has 3 columns and 98 rows.
df_yearly.isna().any().any()
False
df_monthly.isna().any().any()
False
We don't have any NaN values.
df_yearly.duplicated().any()
False
df_monthly.duplicated().any()
False
No duplicate data either.
df_yearly.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12 entries, 0 to 11 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 12 non-null int64 1 births 12 non-null int64 2 deaths 12 non-null int64 3 clinic 12 non-null object dtypes: int64(3), object(1) memory usage: 512.0+ bytes
df_monthly.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 98 entries, 0 to 97 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 98 non-null datetime64[ns] 1 births 98 non-null int64 2 deaths 98 non-null int64 dtypes: datetime64[ns](1), int64(2) memory usage: 2.4 KB
Numbers and dates already in the right data types 👌
df_yearly.describe()
| year | births | deaths | |
|---|---|---|---|
| count | 12.00 | 12.00 | 12.00 |
| mean | 1,843.50 | 3,152.75 | 223.33 |
| std | 1.78 | 449.08 | 145.38 |
| min | 1,841.00 | 2,442.00 | 66.00 |
| 25% | 1,842.00 | 2,901.75 | 100.25 |
| 50% | 1,843.50 | 3,108.50 | 219.50 |
| 75% | 1,845.00 | 3,338.25 | 263.50 |
| max | 1,846.00 | 4,010.00 | 518.00 |
df_monthly.describe()
| births | deaths | |
|---|---|---|
| count | 98.00 | 98.00 |
| mean | 267.00 | 22.47 |
| std | 41.77 | 18.14 |
| min | 190.00 | 0.00 |
| 25% | 242.50 | 8.00 |
| 50% | 264.00 | 16.50 |
| 75% | 292.75 | 36.75 |
| max | 406.00 | 75.00 |
We have an average of 223 deaths per year and 22 deaths per month... and 1844 births per year and 267 births per month.
Challenge: How dangerous was childbirth in the 1840s in Vienna?
In comparison, the United States recorded 18.5 maternal deaths per 100,000 or 0.018% in 2013 (source).
death_year_pct = df_yearly.deaths.sum()/df_yearly.births.sum()
print(f"In the 1840s in Vienna, {death_year_pct*100:.0f}% \
of women giving birth died.")
In the 1840s in Vienna, 7% of women giving birth died.
Challenge: Create a Matplotlib chart with twin y-axes. It should look something like this.
skyblue and crimson for the line colours# Create locators for ticks on the time axis
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_formatter = mdates.DateFormatter('%Y')
plt.figure(figsize=(14,8), dpi=200)
plt.xticks(fontsize=14, rotation=45)
plt.yticks(fontsize=14)
plt.title("Total Number of Monthly Births and Deaths", fontsize=18)
plt.xlabel('Time', fontsize=16)
plt.ylabel('Number of Births', fontsize=16)
plt.xlim(df_monthly.date.min(), df_monthly.date.max())
ax1 = plt.gca()
ax2 = plt.twinx()
ax1.plot('date', 'births', data=df_monthly, c='skyblue', linewidth=3)
ax2.plot('date', 'deaths', data=df_monthly, c='crimson', linewidth=2,
linestyle='--')
ax2.set_ylabel('Number of Deaths', fontsize=16)
plt.setp(ax2.get_yticklabels(), fontsize=14)
ax1.grid(color='grey', linestyle='--')
ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_formatter)
ax1.xaxis.set_minor_locator(months)
plt.show()
Now let's look at the annual data instead.
Challenge: Use plotly to create line charts of the births and deaths of the two different clinics at the Vienna General Hospital.
fig = px.line(df_yearly, x='year', y='births', color='clinic',
height=600, title='Total Yearly Births by Clinic',
labels={'year': 'Time', 'births': 'Number of Births',
'deaths': 'Number of Deaths', 'clinic': 'Clinic'})
fig.show()
fig = px.line(df_yearly, x='year', y='deaths', color='clinic',
height=600, title='Total Yearly Deaths by Clinic',
labels={'year': 'Time', 'births': 'Number of Births',
'deaths': 'Number of Deaths', 'clinic': 'Clinic'})
fig.show()
Challenge: Calculate the proportion of maternal deaths per clinic. That way we can compare like with like.
df_yearly DataFrame by adding a column called "pct_deaths". pct_deaths = df_yearly.deaths / df_yearly.births
df_yearly.insert(loc=3, column='pct_deaths', value=pct_deaths)
df_yearly.head(3)
| year | births | deaths | pct_deaths | clinic | |
|---|---|---|---|---|---|
| 0 | 1841 | 3036 | 237 | 0.08 | clinic 1 |
| 1 | 1842 | 3287 | 518 | 0.16 | clinic 1 |
| 2 | 1843 | 3060 | 274 | 0.09 | clinic 1 |
total_deaths_clinic1 = df_yearly.deaths.loc[df_yearly.clinic=='clinic 1'].sum()
total_deaths_clinic2 = df_yearly.deaths.loc[df_yearly.clinic=='clinic 2'].sum()
total_births_clinic1 = df_yearly.births.loc[df_yearly.clinic=='clinic 1'].sum()
total_births_clinic2 = df_yearly.births.loc[df_yearly.clinic=='clinic 2'].sum()
avg_deaths_clinic1 = total_deaths_clinic1 / total_births_clinic1
avg_deaths_clinic2 = total_deaths_clinic2 / total_births_clinic2
display(avg_deaths_clinic1)
display(avg_deaths_clinic2)
0.09924159265542361
0.038839862852003824
We could also have just done the average of the % column.
display(df_yearly.pct_deaths.loc[df_yearly.clinic=='clinic 1'].mean())
display(df_yearly.pct_deaths.loc[df_yearly.clinic=='clinic 2'].mean())
0.0985052720217127
0.04039993689236435
Actually no, this doesn't work. In this case, I shouldnt average percentages, because the sample size on each row is different, e.g. one year has 4000 births and another has 3000, so each row is not equal. Each row percentage has a different weight.
print(f"The average maternal death rate for"\
f"clinic 1 is {avg_deaths_clinic1*100:.1f}%")
print(f"The average maternal death rate for"\
f"clinic 2 is {avg_deaths_clinic2*100:.1f}%")
The average maternal death rate forclinic 1 is 9.9% The average maternal death rate forclinic 2 is 3.9%
fig = px.line(df_yearly, x='year', y='pct_deaths', color='clinic',
height=600, title='Yearly Maternal Death Rate by Clinic',
labels={'year': 'Time', 'births': 'Number of Births',
'deaths': 'Number of Deaths', 'clinic': 'Clinic',
'pct_deaths': 'Maternal Death Rate'})
fig.show()
The story continues...
At first, Dr Semmelweis thought that the position of the women giving birth was the issue. In clinic 2, the midwives' clinic, women gave birth on their sides. In the doctors' clinic, women gave birth on their backs. So, Dr. Semmelweis, had women in the doctors' clinic give birth on their sides. However, this had no effect on the death rate.
Next, Dr Semmelweis noticed that whenever someone on the ward died, a priest would walk through clinic 1, past the women's beds ringing a bell 🔔 . Perhaps the priest and the bell ringing terrified the women so much after birth that they developed a fever, got sick and died. Dr Semmelweis had the priest change his route and stop ringing the bell 🔕 . Again, this had no effect.
At this point, Dr Semmelweis was so frustrated he went on holiday to Venice. Perhaps a short break would clear his head. When Semmelweis returned from his vacation, he was told that one of his colleagues, a pathologist, had fallen ill and died. His friend had pricked his finger while doing an autopsy on a woman who had died from childbed fever and subsequently got very sick himself and died. 😮
In June 1847, Dr Semmelweis ordered everyone on his medical staff to start cleaning their hands and instruments not just with soap and water but with a chlorine solution (he didn't know it at the time, but chlorine is an amazing disinfectant). The reason Dr Semmelweis actually chose the chlorine was that he wanted to get rid of any smell on doctors' hands after an autopsy. No one knew anything about bacteria, germs or viruses at the time.
# Date when handwashing was made mandatory
handwashing_start = pd.to_datetime('1847-06-01')
Challenge:
df_monthly that has the percentage of deaths per birth for each row. df_monthly data: before and after Dr Semmelweis ordered washing hand.df_monthly['pct_deaths'] = df_monthly.deaths/df_monthly.births
df_monthly.head(3)
| date | births | deaths | pct_deaths | |
|---|---|---|---|---|
| 0 | 1841-01-01 | 254 | 37 | 0.15 |
| 1 | 1841-02-01 | 239 | 18 | 0.08 |
| 2 | 1841-03-01 | 277 | 12 | 0.04 |
df_monthly_before = df_monthly.loc[df_monthly.date < handwashing_start]
df_monthly_after = df_monthly.loc[df_monthly.date >= handwashing_start]
avg_death_prior = df_monthly_before.deaths.sum() / df_monthly_before.births.sum()
avg_death_after = df_monthly_after.deaths.sum() / df_monthly_after.births.sum()
print(f"The average death rate prior to June 1847 is {avg_death_prior*100:.2f}%")
print(f"The average death rate after June 1847 is {avg_death_after*100:.2f}%")
The average death rate prior to June 1847 is 10.53% The average death rate after June 1847 is 2.15%
The death rate per birth dropped dramatically after handwashing started - from close to 10.53% to 2.15%. We can use the colon and dot inside a print statement to determine the number of digits we'd like to print out from a number.
Challenge: Create a DataFrame that has the 6 month rolling average death rate prior to mandatory handwashing.
Hint: You'll need to set the dates as the index in order to avoid the date column being dropped during the calculation.
df_monthly_rolling = df_monthly.set_index('date').rolling(window=6).mean()
Wait a minute, I think we're facing the same issue as before when averaging percentages...
For now, if I just display df_monthly_rolling.deaths / df_monthly_rolling.births and df_monthly_rolling.pct_deaths, they look similar though.
if I want to replace an entire column, and return the new DataFrame, the .replace(to_replace=, value=) method doesn't work. This is only to replace actual values inside a DataFrame, e.g. replace all ones by twos.
Instead, to replace an entire column, use .assign() with the name of the column as parameter, and the argument being a scalar or a Series. See examples below:
df_monthly_rolling.assign(pct_deaths=1)
df_monthly_rolling.assign(pct_deaths=df_monthly_rolling.deaths / df_monthly_rolling.births)
In pandas, NaN is actually not equal to NaN, so NaN != NaN is True.
To get the following statement Nan == NaN to return True, we can evaluate it like this: (a!=a)&(b!=b)
In the end, to compare the values of two Series, where NaN are equal, I can use np.all((a == b) | (np.isnan(a) & np.isnan(b)))
test.pct_deaths == df_monthly_rolling.deaths / df_monthly_rolling.births
a = test.pct_deaths
b = df_monthly_rolling.deaths / df_monthly_rolling.births
(a == b) | (np.isnan(a) & np.isnan(b)) # Return the Series of True or False
np.all((a == b) | (np.isnan(a) & np.isnan(b))) # Return a unique True or False
import numexpr
pd.Series(numexpr.evaluate('(a==b)|((a!=a)&(b!=b))')) # Return the Series of True or False
np.all(numexpr.evaluate('(a==b)|((a!=a)&(b!=b))')) # Return a unique True or False
.equals() Pandas method.a==b # Return a Series, where True if values are equal, but False for NaN
(a==b).all() # Return a unique boolean, which is False in this case because of the NaNs
a.equals(b) # Return a unique boolean too, but True in this case, considering NaN equal
.compare() method returns a DataFrame with the columns self and other to show the difference between the values when they're not equal, e.g. a.compare(b)pd.options.display.max_rows
60
Using the command pd.describe_option('display') will give us a bunch of information about the current display options, such as:
display.large_repr : 'truncate'/'info'
For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can show a truncated table (the default from 0.13), or switch to the view from df.info() (the behaviour in earlier versions of pandas).
[default: truncate] [currently: truncate]
display.max_rows : int
If max_rows is exceeded, switch to truncate view. Depending on large_repr, objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited. [...]
[default: 60] [currently: 60]
display.max_columns : int
[...]
display.min_rows : int
The numbers of rows to show in a truncated view (when max_rows is exceeded). [...]
We can change these options, like the max number of rows before truncating: pd.set_option('display.max_rows', 500)
But we can also change the option temporarily:
from IPython.display import display
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
display(df) #need display to show the dataframe when using with in jupyter
#some pandas stuff
from IPython.display import display
a = df_monthly_rolling.pct_deaths
b = df_monthly_rolling.deaths / df_monthly_rolling.births
a.equals(b)
False
It seems that indeed, the rolling average of the percentages and the actual averages of the rolling values are not the same. If we want to display the Boolean Series resulting of the comparison, we don't get the whole thing though, so we can temporaily change the display options like so (98 rows in the df):
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
display((a == b) | (np.isnan(a) & np.isnan(b)))
And we see that actually only the NaN values are the same (True), all the other values are different (False).
with pd.option_context('display.float_format', '{:,.5f}'.format,
'display.min_rows', 6):
display(a.compare(b))
| self | other | |
|---|---|---|
| date | ||
| 1841-06-01 | 0.05631 | 0.05608 |
| 1841-07-01 | 0.04606 | 0.04379 |
| 1841-08-01 | 0.03576 | 0.03360 |
| ... | ... | ... |
| 1849-01-01 | 0.01630 | 0.01685 |
| 1849-02-01 | 0.02144 | 0.02157 |
| 1849-03-01 | 0.02805 | 0.02844 |
93 rows × 2 columns
By changing the way the float are displayed, we can see that, although very close, the values are indeed different.
So, we need to reassign the right values.
df_monthly_rolling.pct_deaths = df_monthly_rolling.deaths / df_monthly_rolling.births
df_monthly_before_rolling = df_monthly_before.set_index('date')\
.rolling(window=6).mean()
df_monthly_before_rolling.pct_deaths =\
df_monthly_before_rolling.deaths.divide(df_monthly_before_rolling.births)
Challenge: Copy-paste and then modify the Matplotlib chart from before to plot the monthly death rates (instead of the total number of births and deaths). The chart should look something like this:

Bringing the date back to a column instead of an index works better for the x and y data in the graphs further below:
df_monthly_before_rolling.reset_index(inplace=True) # not "drop=True" here!
plt.figure(figsize=(14,8), dpi=200)
plt.xticks(fontsize=14, rotation=45)
plt.yticks(fontsize=14)
plt.title("Evolution of Maternal Death Rate", fontsize=18)
plt.xlabel('Time', fontsize=16)
plt.ylabel('Rate of Maternal Death', fontsize=16)
plt.xlim(df_monthly.date.min(), df_monthly.date.max())
plt.plot('date', 'pct_deaths', data=df_monthly_before,
c='black', linewidth=1, linestyle='--')
plt.plot('date', 'pct_deaths', data=df_monthly_before_rolling,
c='crimson', linewidth=3, linestyle='--')
plt.plot('date', 'pct_deaths', data=df_monthly_after,
c='skyblue', linewidth=3, marker='o')
ax = plt.gca()
ax.grid(color='grey', linestyle='--')
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(years_formatter)
ax.xaxis.set_minor_locator(months)
plt.legend(['Before Handwashing', '6months Rolling Average',
'After Handwashing'], fontsize=16)
plt.show()
.plot()¶It is noteworhty to know that plt.plot() actually returns a list of Line2D (a.k.a. a list of lines representing the plotted data).
We can then reuse them in some functions, like .legend(), e.g.
ma_line, = plt.plot(...)
bw_line, = plt.plot(...)
aw_line, = plt.plot(...)
plt.legend(handles=[ma_line, bw_line, aw_line],
fontsize=18)
Challenge:
df_monthly_before.describe()
df_monthly_after.describe()
Note: This time, we do ask for an average of the monthly death rate, so an average of averages.
avg_death_monthly_before = df_monthly_before.pct_deaths.mean()
avg_death_monthly_after = df_monthly_after.pct_deaths.mean()
print(f"The average percentage of monthly deaths before handwashing \
was {avg_death_monthly_before*100:.2f}%")
print(f"The average percentage of monthly deaths after handwashing \
was {avg_death_monthly_after*100:.2f}%")
The average percentage of monthly deaths before handwashing was 10.50% The average percentage of monthly deaths after handwashing was 2.11%
print(f"The chance of dying in childbirth dropped \
{(avg_death_monthly_before-avg_death_monthly_after)*100:.2f}%")
The chance of dying in childbirth dropped 8.40%
The previous death rate was avg_death_prior and the new death rate avg_death_after, so the chance of dying reduced by:
print(f"a factor {avg_death_prior/avg_death_after:.1f}!")
a factor 4.9!
A lot of statistical tests rely on comparing features of distributions, like the mean. We see that the average death rate before handwashing was 10.5%. After handwashing was made obligatory, the average death rate was 2.11%. The difference is massive. Handwashing decreased the average death rate by 8.4%, a 5x improvement.
Note: careful with wording, "is 8.4% less" DOES NOT have the same meaning as "reduced by or decreased by 8.4%".
The statistic above is impressive, but how do we show it graphically? With a box plot we can show how the quartiles, minimum, and maximum values changed in addition to the mean.
Challenge:
.where() function to add a column to df_monthly that shows if a particular date was before or after the start of handwashing. .where()¶>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.where(a < 5, a, 10*a)
array([ 0, 1, 2, 3, 4, 50, 60, 70, 80, 90])
df_monthly['washing_hands'] = np.where(df_monthly.date < handwashing_start, 'No', 'Yes')
df_monthly.sample(2)
| date | births | deaths | pct_deaths | washing_hands | |
|---|---|---|---|---|---|
| 87 | 1848-05-01 | 313 | 3 | 0.01 | Yes |
| 63 | 1846-05-01 | 305 | 41 | 0.13 | No |
fig = px.box(df_monthly, x='washing_hands', y='pct_deaths',
color='washing_hands', height=600,
labels={'washing_hands': 'Washing Hands?',
'pct_deaths': 'Rate of Monthly Maternal Death'},
title='How Have the Stats Changed with Handwashing?')
fig.show()
The plot shows us the same data as our Matplotlib chart, but from a different perspective. Here we also see the massive spike in deaths in late 1842. Over 30% of women who gave birth that month died in hospital. What we also see in the box plot is how not only did the average death rate come down, but so did the overall range - we have a lower max and 3rd quartile too. Let's take a look at a histogram to get a better sense of the distribution.
Challenge: Create a plotly histogram to show the monthly percentage of deaths.
color parameter to display two overlapping histograms.histnorm to percent to make the time periods comparable. marginal parameter. from sympy import divisors, divisor_count
divisor_count(98)
divisors(98)
[1, 2, 7, 14, 49, 98]
Using new parameters in Plotly, with histogram: color=, opacity=, barmode=, histfunc=, histnorm=
fig = px.histogram(df_monthly, x='pct_deaths', histfunc='count',
nbins=30, color='washing_hands', opacity=0.6,
barmode='overlay')
# using color divides the histogram in two
fig.update_layout(xaxis_title='Rate of Monthly Maternal Deaths',
yaxis_title='Number of Months')
fig.show()
However, the sample of data for "No Washing Hands" is bigger, so we can't really compare the two histograms by the count. We'll change the histnorm to 'percent' instead, meaning, for example, 40% of the months had a dead rate of such.
fig = px.histogram(df_monthly, x='pct_deaths', histfunc='count',
nbins=30, color='washing_hands', opacity=0.6,
barmode='overlay', histnorm='percent')
fig.update_layout(xaxis_title='Rate of Monthly Maternal Deaths',
yaxis_title='% of months')
fig.show()
This kinda tells the same story as the box plot, i.e. all the data is concentrated to a low death rate for the "Yes Washing Hands", and it's much more spread-out with a higher average for the "No".
We can actually display both of them together, using the marginal= parameter.
fig = px.histogram(df_monthly, x='pct_deaths', histfunc='count',
nbins=30, color='washing_hands', opacity=0.6,
barmode='overlay', histnorm='percent',
marginal='box', height=500)
fig.update_layout(xaxis_title='Rate of Monthly Maternal Deaths',
yaxis_title='Number of Months')
fig.show()
Now, we have only about 98 data points or so, so our histogram looks a bit jagged. It's not a smooth bell-shaped curve. However, we can estimate what the distribution would look like with a Kernel Density Estimate (KDE).
Challenge: Use Seaborn's .kdeplot() to create two kernel density estimates of the pct_deaths, one for before handwashing and one for after.
shade parameter to give your two distributions different colours. clip parameter to address the problem. plt.figure(figsize=(10,6), dpi=120)
kde = sns.kdeplot(data=df_monthly, x='pct_deaths', hue='washing_hands',
clip=(0,1), common_norm=False,
fill=True)
kde.set(xlim=(0, 0.4), xlabel='Rate of Monthly Maternal Deaths',
ylabel='Density Distribution')
kde.legend_.set_title('Washing Hands?')
plt.title('Estimated Distribution of Monthly Maternal Death Rate')
plt.show()
The shade parameter is deprecated, it is fill now. This just fills up the area under the curve with a color.
Initially, the KDE goes into the negative on the x-axis, so we use the clip=(0,1) to "not evaluate the density outside of these limits".
Finally, we use common_norm=False to "normalize each density independently", i.e. the area under each sums to 1. If True, then it's the the "total area under all densities" which "sums to 1", i.e. they're normalized in common.
Now that we have an idea of what the two distributions look like, we can further strengthen our argument for handwashing by using a statistical test. We can test whether our distributions ended up looking so different purely by chance (i.e., the lower death rate is just an accident) or if the 8.4% difference in the average death rate is statistically significant.
Challenge: Use a t-test to determine if the differences in the means are statistically significant or purely due to chance.
If the p-value is less than 1% then we can be 99% certain that handwashing has made a difference to the average monthly death rate.
stats from scipy.ttest_ind() function to calculate the t-statistic and the p-valueWhat Is a T-Test?
A t-test is an inferential statistic used to determine if there is a significant difference between the means of two groups and how they are related. T-tests are used when the data sets follow a normal distribution and have unknown variances, like the data set recorded from flipping a coin 100 times.
The t-test is a test used for hypothesis testing in statistics and uses the t-statistic, the t-distribution values, and the degrees of freedom to determine statistical significance.
T-Score
A large t-score, or t-value, indicates that the groups are different while a small t-score indicates that the groups are similar.
P-value
Every t-value has a p-value to go with it. A p-value from a t test is the probability that the results from your sample data occurred by chance. P-values are from 0% to 100% and are usually written as a decimal (for example, a p value of 5% is 0.05). Low p-values indicate your data did not occur by chance. For example, a p-value of .01 means there is only a 1% probability that the results from an experiment happened by chance.
from scipy import stats
t_test = stats.ttest_ind(a=df_monthly.pct_deaths.loc[df_monthly.washing_hands == 'No'],
b=df_monthly.pct_deaths.loc[df_monthly.washing_hands == 'Yes'])
t_test
Ttest_indResult(statistic=5.511607211341916, pvalue=2.985436556724523e-07)
print(f"{t_test.pvalue:.10f}")
0.0000002985
The P-value is incredibly small! There is no way in the world the results occurred by chance.
We could also use the ttest_ind like so:
t_stat, p_value = stats.ttest_ind(a=df_monthly_before.pct_deaths,
b=df_monthly_after.pct_deaths)
print(f'p-palue is {p_value:.10f}')
print(f't-statstic is {t_stat:.4}')
p-palue is 0.0000002985 t-statstic is 5.512
When we calculate the p_value we see that it is 0.0000002985 or .00002985% which is far below even 1%. In other words, the difference in means is highly statistically significant and we can go ahead on publish our research paper
What do you conclude from your analysis, Doctor? 😊
